Lab 1 - Exploring Table Data

Members:
CSE 5324:Shenxiao Mei, Samuel Lefcourt
CSE 7324: Chenming Cui, Zeqi Guo

1   Introduction

In your own words, give an overview of the dataset

The transfers of athletes always play a huge part in the modern soccer industry. To some soccer clubs, it is a way to improve their performance in their leagues; to others, it might be a way to make profits. This dataset records the top 250 most expensive player transfers each year starting from the 2000-2001 season.

2  Business Understanding

2.1   Goals and Objective

Describe the purpose of the data set you selected (i.e., why and how was this data collected in the first place?).

Player transfers are a crucial act which will usually affect a soccer club's future. Soccer clubs now throw a large amount of money on scouting potential players in the market. However, since soccer is a highly commericial business, every team has a budget that limits their target amount and quality. A middle level team may not be able to afford the transfer fee for a superstar such as Kylian MBappe. This means they have to wait until the transfer fee goes down. Therefore, our goal is to assist teams in predicting when a player's trading value is less than his market value.

Soccer transfers are fluid. For example, Mohamed Salah's market value rose up to 150 million euros after he was transferred for only 42 million euros. Thus, our second goal is to compare a player's market value and transfer fee and make predictions so that a team manager can wait for the best moment to make a transfer. It will be very helpful for the teams with a relatively limited budget and help them find the optimal transfer choices.

2.2   Measure of success

Once you begin modeling, how well would your prediction algorithm need to perform to be considered useful to these third parties?What is the prediction task for your data and why are other third parties interested in the result?

Our algorithm will be used to find the possibility of having a surplus transfer fee; that is, having a transfer fee larger than market value. In the real world, soccer teams and soccer players themselves will be interested in our prediction as third parties. The clubs will be able to identify if they can benefit from the transfer based on our prediction. The player will be able to self-evaluate with our prediction. We can consider a soccer transfer as an "investment". Just like any other financial investment in the world, players are like stocks - you either make or lose money. To measure our success, we use the stats from a real economic market. According to https://www.financial-math.org/blog/2017/03/how-accurate-are-market-forecasters/, a top-ranking market forecaster was 78.7% accurate in determining the state of the market. The next best forecasters had 72.5%, 71.8% and 70.5% accuracy scores. It is true that soccer cannot be perfectly quantified, but transfers are a highly mature market system. Each club has its own private scouting team, where each scout can have an average salary of $36,000 dollars per year. Modern soccer scouts will assess a player by not only watching him, but also by doing data analysis. Our prediction is free and using data only, so we feel 70% prediction accuracy can be our success criteria.

3   Data Understanding

Load the dataset and appropriately define data types. What data type should be used to represent each data attribute? Discuss the attributes collected in the dataset. For datasets with a large number of attributes, only discuss a subset of relevant attributes.

The introduction of all attributes existing in our dataset.

3.1   General Player Info

  • Name(text):The name of the player
  • Age(ordinal):The age of the player
  • Age_range(nominal): The range in which the player's age fall and has four values:
    1. 16-20
    2. 20-23
    3. 23-28
    4. 28-35
  • Market_value(ordinal):The actual value of the player
  • Transfer_fee(ordinal):The transfer cost on the player
  • Position(nominal): Position is where the player play at in the field. There are 13 types of position:
    1. Attacking Midfield
    2. Central Midfield
    3. Defensive Midfield
    4. Left Midfield
    5. Right Midfield
    6. Centre-Back
    7. Left-Back
    8. Right-Back
    9. Centre-Forward
    10. Right Winger
    11. Left Winger
    12. Second Striker
    13. Goalkeeper
  • General_Position(nominal): Genaral position is determined by position and has three values
    1. Forward
    2. Midfield
    3. Back

3.2   Transfer Info

  • Transfer_fee(ordinal):The transfer expense on the player
  • Season(ordinal):The season when the transfer happened
  • League_from(nominal): the league which the old team is in
  • League_to(nominal):the league which the new team is in
  • Team_from(nominal): the player's old team
  • Team_to(nominal):the player's new team
  • Premium_rate(ratio): This attribute is made by us, it is the calculated by Transfer_fee/Market_value. So whenever it is larger than 1, it means the tranfer fee at this time is over the actual market value
  • isPremium(binary): If player's transfer fee is higher than market value, isPremium is set to 1 and vice versa
  • fee_range(nominal): The range in which the transfer fee fall has four values.
    1. 2millions (0-2millions)
    2. 20millions (2-20millions)
    3. 50millions (20-50millions)
    4. 500millions (50-500millions)

3.3   Load Dataset

In [2]:
import pandas as pd
import numpy as np

print('Pandas:', pd.__version__)
print('Numpy:',np.__version__)

df = pd.read_csv('dataset/top250-00-19.csv') # read in the csv file

df.info()
df.head()
df
Pandas: 0.23.4
Numpy: 1.15.4
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4700 entries, 0 to 4699
Data columns (total 10 columns):
Name            4700 non-null object
Position        4700 non-null object
Age             4700 non-null int64
Team_from       4700 non-null object
League_from     4700 non-null object
Team_to         4700 non-null object
League_to       4700 non-null object
Season          4700 non-null object
Market_value    3440 non-null float64
Transfer_fee    4700 non-null int64
dtypes: float64(1), int64(2), object(7)
memory usage: 367.3+ KB
Out[2]:
Name Position Age Team_from League_from Team_to League_to Season Market_value Transfer_fee
0 Luís Figo Right Winger 27 FC Barcelona LaLiga Real Madrid LaLiga 2000-2001 NaN 60000000
1 Hernán Crespo Centre-Forward 25 Parma Serie A Lazio Serie A 2000-2001 NaN 56810000
2 Marc Overmars Left Winger 27 Arsenal Premier League FC Barcelona LaLiga 2000-2001 NaN 40000000
3 Gabriel Batistuta Centre-Forward 31 Fiorentina Serie A AS Roma Serie A 2000-2001 NaN 36150000
4 Nicolas Anelka Centre-Forward 21 Real Madrid LaLiga Paris SG Ligue 1 2000-2001 NaN 34500000
5 Rio Ferdinand Centre-Back 22 West Ham Premier League Leeds Premier League 2000-2001 NaN 26000000
6 Flávio Conceicao Central Midfield 26 Dep. La Coruña LaLiga Real Madrid LaLiga 2000-2001 NaN 25000000
7 Savo Milosevic Centre-Forward 26 Real Zaragoza LaLiga Parma Serie A 2000-2001 NaN 25000000
8 David Trézéguet Centre-Forward 22 Monaco Ligue 1 Juventus Serie A 2000-2001 NaN 23240000
9 Claudio López Centre-Forward 25 Valencia CF LaLiga Lazio Serie A 2000-2001 NaN 23000000
10 Jimmy Floyd Hasselbaink Centre-Forward 28 Atlético Madrid LaLiga2 Chelsea Premier League 2000-2001 NaN 22500000
11 Gerard López Central Midfield 21 Valencia CF LaLiga FC Barcelona LaLiga 2000-2001 NaN 21600000
12 Lucas Centre-Forward 21 Atlético-PR Brazil Stade Rennais Ligue 1 2000-2001 NaN 21300000
13 Pablo Aimar Attacking Midfield 21 River Plate Argentina Valencia CF LaLiga 2000-2001 NaN 21250000
14 Wálter Samuel Centre-Back 22 Boca Juniors Argentina AS Roma Serie A 2000-2001 NaN 20800000
15 Shabani Nonda Centre-Forward 23 Stade Rennais Ligue 1 Monaco Ligue 1 2000-2001 NaN 20000000
16 Robbie Keane Centre-Forward 19 Coventry City Premier League Inter Serie A 2000-2001 NaN 19500000
17 José Mari Centre-Forward 21 Atlético Madrid LaLiga2 AC Milan Serie A 2000-2001 NaN 19000000
18 Jonathan Zebina Centre-Back 21 Cagliari Calcio Serie B AS Roma Serie A 2000-2001 NaN 18400000
19 Émerson Defensive Midfield 24 Bay. Leverkusen 1.Bundesliga AS Roma Serie A 2000-2001 NaN 18000000
20 Tore André Flo Centre-Forward 27 Chelsea Premier League Rangers Scotland 2000-2001 NaN 18000000
21 Sergiy Rebrov Second Striker 26 Dynamo Kyiv Premier Liga Spurs Premier League 2000-2001 NaN 18000000
22 Angelo Peruzzi Goalkeeper 30 Inter Serie A Lazio Serie A 2000-2001 NaN 17900000
23 Diego Tristán Centre-Forward 24 RCD Mallorca LaLiga Dep. La Coruña LaLiga 2000-2001 NaN 17750000
24 Sylvain Wiltord Second Striker 26 G. Bordeaux Ligue 1 Arsenal Premier League 2000-2001 NaN 17500000
25 Fernando Redondo Defensive Midfield 31 Real Madrid LaLiga AC Milan Serie A 2000-2001 NaN 17500000
26 Jardel Centre-Forward 26 FC Porto Liga NOS Galatasaray Süper Lig 2000-2001 NaN 17050000
27 Nuno Gomes Centre-Forward 24 Benfica Liga NOS Fiorentina Serie A 2000-2001 NaN 17000000
28 Sérgio Conceição Right Winger 25 Lazio Serie A Parma Serie A 2000-2001 NaN 17000000
29 Alfonso Centre-Forward 27 Real Betis LaLiga2 FC Barcelona LaLiga 2000-2001 NaN 16500000
... ... ... ... ... ... ... ... ... ... ...
4670 Benedikt Höwedes Centre-Back 30 FC Schalke 04 1.Bundesliga Loko Moscow Premier Liga 2018-2019 9000000.0 5000000
4671 Gonzalo Castro Central Midfield 31 Bor. Dortmund 1.Bundesliga VfB Stuttgart 1.Bundesliga 2018-2019 8000000.0 5000000
4672 Sergi Gómez Centre-Back 26 Celta de Vigo LaLiga Sevilla FC LaLiga 2018-2019 8000000.0 5000000
4673 Adam Masina Left-Back 24 Bologna Serie A Watford Premier League 2018-2019 7000000.0 5000000
4674 Christophe Hérelle Centre-Back 25 Troyes Ligue 2 OGC Nice Ligue 1 2018-2019 6000000.0 5000000
4675 Firmin Mubele Centre-Forward 24 Stade Rennais Ligue 1 Toulouse Ligue 1 2018-2019 6000000.0 5000000
4676 Florent Hadergjonaj Right-Back 23 FC Ingolstadt 2.Bundesliga Huddersfield Premier League 2018-2019 5000000.0 5000000
4677 Nemanja Maksimovic Central Midfield 23 Valencia CF LaLiga Getafe CF LaLiga 2018-2019 5000000.0 5000000
4678 Juan Iturbe Right Winger 25 AS Roma Serie A Club Tijuana Liga MX Apertura 2018-2019 4000000.0 5000000
4679 Sidcley Left-Back 25 Atlético-PR Série A Dynamo Kyiv Premier Liga 2018-2019 4000000.0 5000000
4680 Fabrício Centre-Forward 28 Portimonense Liga NOS Urawa Reds J1 League 2018-2019 4000000.0 5000000
4681 Johan Mojica Left-Back 25 Rayo Vallecano LaLiga Girona FC LaLiga 2018-2019 4000000.0 5000000
4682 Petros Central Midfield 29 São Paulo Série A Nasr Professional League 2018-2019 3000000.0 5000000
4683 Christian Kouamé Centre-Forward 20 Cittadella Serie B Genoa Serie A 2018-2019 2800000.0 5000000
4684 Luca Waldschmidt Centre-Forward 22 Hamburger SV 2.Bundesliga SC Freiburg 1.Bundesliga 2018-2019 2500000.0 5000000
4685 Marc Muniesa Centre-Back 26 Stoke City Championship Girona FC LaLiga 2018-2019 2500000.0 5000000
4686 Pietro Iemmello Centre-Forward 26 Sassuolo Serie A Benevento Serie B 2018-2019 2000000.0 5000000
4687 Felipe Vizeu Centre-Forward 21 Flamengo Série A Udinese Calcio Serie A 2018-2019 1500000.0 5000000
4688 Marco Tumminello Centre-Forward 19 AS Roma Serie A Atalanta Serie A 2018-2019 1000000.0 5000000
4689 Jens Odgaard Centre-Forward 19 Inter Serie A Sassuolo Serie A 2018-2019 800000.0 5000000
4690 Marco Carraro Defensive Midfield 20 Inter Serie A Atalanta Serie A 2018-2019 800000.0 5000000
4691 Jonathan Ikoné Attacking Midfield 20 Paris SG Ligue 1 LOSC Lille Ligue 1 2018-2019 700000.0 5000000
4692 Giangiacomo Magnani Centre-Back 22 Juventus Serie A Sassuolo Serie A 2018-2019 600000.0 5000000
4693 Giangiacomo Magnani Centre-Back 22 Perugia Serie B Juventus Serie A 2018-2019 600000.0 5000000
4694 Evan N'Dicka Centre-Back 18 AJ Auxerre Ligue 2 E. Frankfurt 1.Bundesliga 2018-2019 400000.0 5000000
4695 Jasmin Kurtic Attacking Midfield 29 Atalanta Serie A SPAL Serie A 2018-2019 5000000.0 4800000
4696 Tchê Tchê Central Midfield 25 Palmeiras Série A Dynamo Kyiv Premier Liga 2018-2019 3000000.0 4800000
4697 Silvan Widmer Right-Back 25 Udinese Calcio Serie A FC Basel Super League 2018-2019 8500000.0 4500000
4698 Yuya Osako Second Striker 28 1. FC Köln 2.Bundesliga Werder Bremen 1.Bundesliga 2018-2019 4500000.0 4500000
4699 Kyle Bartley Centre-Back 27 Swansea Championship West Brom Championship 2018-2019 3500000.0 4500000

4700 rows × 10 columns

3.4   Data Quality

Verify data quality: Explain any missing values or duplicate data. Visualize entries that are missing/complete for different attributes. Are those mistakes? Why do these quality issues exist in the data? How do you deal with these problems? Give justifications for your methods (elimination or imputation).

3.4.1   Duplicate Data

We check if there are duplicate records of the transfer by checking the duplicates of the subset called dup; consisting of the name, season and new team. According to the results, we see there are no duplicate records in the data.

In [3]:
dup=['Name','Season','Team_to']
print(df.duplicated(subset=dup, keep=False))
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
4670    False
4671    False
4672    False
4673    False
4674    False
4675    False
4676    False
4677    False
4678    False
4679    False
4680    False
4681    False
4682    False
4683    False
4684    False
4685    False
4686    False
4687    False
4688    False
4689    False
4690    False
4691    False
4692    False
4693    False
4694    False
4695    False
4696    False
4697    False
4698    False
4699    False
Length: 4700, dtype: bool

3.4.2   Unnecessary Column

Once we finished checking the existence of duplicates, we delete unnecessary columns. In our case, we delete the 'Name' column because it is not likely to have an impact on the transfer fee or transfer frequency.

In [4]:
if 'Name' in df:
    del df['Name']
df
Out[4]:
Position Age Team_from League_from Team_to League_to Season Market_value Transfer_fee
0 Right Winger 27 FC Barcelona LaLiga Real Madrid LaLiga 2000-2001 NaN 60000000
1 Centre-Forward 25 Parma Serie A Lazio Serie A 2000-2001 NaN 56810000
2 Left Winger 27 Arsenal Premier League FC Barcelona LaLiga 2000-2001 NaN 40000000
3 Centre-Forward 31 Fiorentina Serie A AS Roma Serie A 2000-2001 NaN 36150000
4 Centre-Forward 21 Real Madrid LaLiga Paris SG Ligue 1 2000-2001 NaN 34500000
5 Centre-Back 22 West Ham Premier League Leeds Premier League 2000-2001 NaN 26000000
6 Central Midfield 26 Dep. La Coruña LaLiga Real Madrid LaLiga 2000-2001 NaN 25000000
7 Centre-Forward 26 Real Zaragoza LaLiga Parma Serie A 2000-2001 NaN 25000000
8 Centre-Forward 22 Monaco Ligue 1 Juventus Serie A 2000-2001 NaN 23240000
9 Centre-Forward 25 Valencia CF LaLiga Lazio Serie A 2000-2001 NaN 23000000
10 Centre-Forward 28 Atlético Madrid LaLiga2 Chelsea Premier League 2000-2001 NaN 22500000
11 Central Midfield 21 Valencia CF LaLiga FC Barcelona LaLiga 2000-2001 NaN 21600000
12 Centre-Forward 21 Atlético-PR Brazil Stade Rennais Ligue 1 2000-2001 NaN 21300000
13 Attacking Midfield 21 River Plate Argentina Valencia CF LaLiga 2000-2001 NaN 21250000
14 Centre-Back 22 Boca Juniors Argentina AS Roma Serie A 2000-2001 NaN 20800000
15 Centre-Forward 23 Stade Rennais Ligue 1 Monaco Ligue 1 2000-2001 NaN 20000000
16 Centre-Forward 19 Coventry City Premier League Inter Serie A 2000-2001 NaN 19500000
17 Centre-Forward 21 Atlético Madrid LaLiga2 AC Milan Serie A 2000-2001 NaN 19000000
18 Centre-Back 21 Cagliari Calcio Serie B AS Roma Serie A 2000-2001 NaN 18400000
19 Defensive Midfield 24 Bay. Leverkusen 1.Bundesliga AS Roma Serie A 2000-2001 NaN 18000000
20 Centre-Forward 27 Chelsea Premier League Rangers Scotland 2000-2001 NaN 18000000
21 Second Striker 26 Dynamo Kyiv Premier Liga Spurs Premier League 2000-2001 NaN 18000000
22 Goalkeeper 30 Inter Serie A Lazio Serie A 2000-2001 NaN 17900000
23 Centre-Forward 24 RCD Mallorca LaLiga Dep. La Coruña LaLiga 2000-2001 NaN 17750000
24 Second Striker 26 G. Bordeaux Ligue 1 Arsenal Premier League 2000-2001 NaN 17500000
25 Defensive Midfield 31 Real Madrid LaLiga AC Milan Serie A 2000-2001 NaN 17500000
26 Centre-Forward 26 FC Porto Liga NOS Galatasaray Süper Lig 2000-2001 NaN 17050000
27 Centre-Forward 24 Benfica Liga NOS Fiorentina Serie A 2000-2001 NaN 17000000
28 Right Winger 25 Lazio Serie A Parma Serie A 2000-2001 NaN 17000000
29 Centre-Forward 27 Real Betis LaLiga2 FC Barcelona LaLiga 2000-2001 NaN 16500000
... ... ... ... ... ... ... ... ... ...
4670 Centre-Back 30 FC Schalke 04 1.Bundesliga Loko Moscow Premier Liga 2018-2019 9000000.0 5000000
4671 Central Midfield 31 Bor. Dortmund 1.Bundesliga VfB Stuttgart 1.Bundesliga 2018-2019 8000000.0 5000000
4672 Centre-Back 26 Celta de Vigo LaLiga Sevilla FC LaLiga 2018-2019 8000000.0 5000000
4673 Left-Back 24 Bologna Serie A Watford Premier League 2018-2019 7000000.0 5000000
4674 Centre-Back 25 Troyes Ligue 2 OGC Nice Ligue 1 2018-2019 6000000.0 5000000
4675 Centre-Forward 24 Stade Rennais Ligue 1 Toulouse Ligue 1 2018-2019 6000000.0 5000000
4676 Right-Back 23 FC Ingolstadt 2.Bundesliga Huddersfield Premier League 2018-2019 5000000.0 5000000
4677 Central Midfield 23 Valencia CF LaLiga Getafe CF LaLiga 2018-2019 5000000.0 5000000
4678 Right Winger 25 AS Roma Serie A Club Tijuana Liga MX Apertura 2018-2019 4000000.0 5000000
4679 Left-Back 25 Atlético-PR Série A Dynamo Kyiv Premier Liga 2018-2019 4000000.0 5000000
4680 Centre-Forward 28 Portimonense Liga NOS Urawa Reds J1 League 2018-2019 4000000.0 5000000
4681 Left-Back 25 Rayo Vallecano LaLiga Girona FC LaLiga 2018-2019 4000000.0 5000000
4682 Central Midfield 29 São Paulo Série A Nasr Professional League 2018-2019 3000000.0 5000000
4683 Centre-Forward 20 Cittadella Serie B Genoa Serie A 2018-2019 2800000.0 5000000
4684 Centre-Forward 22 Hamburger SV 2.Bundesliga SC Freiburg 1.Bundesliga 2018-2019 2500000.0 5000000
4685 Centre-Back 26 Stoke City Championship Girona FC LaLiga 2018-2019 2500000.0 5000000
4686 Centre-Forward 26 Sassuolo Serie A Benevento Serie B 2018-2019 2000000.0 5000000
4687 Centre-Forward 21 Flamengo Série A Udinese Calcio Serie A 2018-2019 1500000.0 5000000
4688 Centre-Forward 19 AS Roma Serie A Atalanta Serie A 2018-2019 1000000.0 5000000
4689 Centre-Forward 19 Inter Serie A Sassuolo Serie A 2018-2019 800000.0 5000000
4690 Defensive Midfield 20 Inter Serie A Atalanta Serie A 2018-2019 800000.0 5000000
4691 Attacking Midfield 20 Paris SG Ligue 1 LOSC Lille Ligue 1 2018-2019 700000.0 5000000
4692 Centre-Back 22 Juventus Serie A Sassuolo Serie A 2018-2019 600000.0 5000000
4693 Centre-Back 22 Perugia Serie B Juventus Serie A 2018-2019 600000.0 5000000
4694 Centre-Back 18 AJ Auxerre Ligue 2 E. Frankfurt 1.Bundesliga 2018-2019 400000.0 5000000
4695 Attacking Midfield 29 Atalanta Serie A SPAL Serie A 2018-2019 5000000.0 4800000
4696 Central Midfield 25 Palmeiras Série A Dynamo Kyiv Premier Liga 2018-2019 3000000.0 4800000
4697 Right-Back 25 Udinese Calcio Serie A FC Basel Super League 2018-2019 8500000.0 4500000
4698 Second Striker 28 1. FC Köln 2.Bundesliga Werder Bremen 1.Bundesliga 2018-2019 4500000.0 4500000
4699 Centre-Back 27 Swansea Championship West Brom Championship 2018-2019 3500000.0 4500000

4700 rows × 9 columns

3.4.3   Vague Data

When we examine the quality of the dataset, we find that the 'player position' column has some vague data as shown below. Check out the positions that only occurs once in the entire dataset. This vague data could also be caused by human mistake or insufficient record. Because there is only one player in these positions, we do not have a sufficient amount of data by which we can make a prediction.

In [5]:
df_grouped=df.groupby(by=['Position'])
for val, grp in df_grouped:
    print(val,'  ', len(grp))
df.info()
Attacking Midfield    426
Central Midfield    487
Centre-Back    714
Centre-Forward    1218
Defender    1
Defensive Midfield    411
Forward    3
Goalkeeper    180
Left Midfield    87
Left Winger    267
Left-Back    225
Midfielder    1
Right Midfield    63
Right Winger    305
Right-Back    181
Second Striker    130
Sweeper    1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4700 entries, 0 to 4699
Data columns (total 9 columns):
Position        4700 non-null object
Age             4700 non-null int64
Team_from       4700 non-null object
League_from     4700 non-null object
Team_to         4700 non-null object
League_to       4700 non-null object
Season          4700 non-null object
Market_value    3440 non-null float64
Transfer_fee    4700 non-null int64
dtypes: float64(1), int64(2), object(6)
memory usage: 330.5+ KB
In [6]:
df = df[df.Position != 'Forward']
df = df[df.Position != 'Defender']
df = df[df.Position != 'Sweeper']
df = df[df.Position != 'Midfielder']

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4694 entries, 0 to 4699
Data columns (total 9 columns):
Position        4694 non-null object
Age             4694 non-null int64
Team_from       4694 non-null object
League_from     4694 non-null object
Team_to         4694 non-null object
League_to       4694 non-null object
Season          4694 non-null object
Market_value    3440 non-null float64
Transfer_fee    4694 non-null int64
dtypes: float64(1), int64(2), object(6)
memory usage: 366.7+ KB

3.4.4   Missing Data

With no duplicates, however, we noticed that there is a significant amount of data missing in market value of the player. The missing data mostly occurs at very early seasons. The incomplete player assessing system might be responsible for this. That means, twenty years ago, there may not have been an accurate system to evaluate a player's market value based on their stats. And there are still some data missing happened in more recent years, beacause some of players donot have huge reputation, or just attend few days so those data cannot be collected successfully.

In [7]:
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

import missingno as mn

mn.matrix(df.sort_values(by='Season'))
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a23831898>

Explanation:

The rows that have NaN in their market value will be dropped because they could not have been imputed in the future.

In [8]:
df = df[np.isfinite(df['Market_value'])]

import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

import missingno as mn

mn.matrix(df.sort_values(by=['Season']))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a23b35208>

With all rows containing missing data dropped, we can now calculate the premium rate based on market value and transfer fee. This additional column is to help describe the gain and loss of the transfer.

In [9]:
df['premium_rate'] = (df.Transfer_fee/df.Market_value)
df.head()
Out[9]:
Position Age Team_from League_from Team_to League_to Season Market_value Transfer_fee premium_rate
999 Second Striker 20 Boca Juniors Argentina Corinthians Série A 2004-2005 15000000.0 15000000 1.000000
1001 Centre-Forward 18 Lecce Serie A Fiorentina Serie A 2004-2005 13000000.0 14000000 1.076923
1004 Centre-Back 27 CSKA Moscow Premier Liga Chelsea Premier League 2004-2005 5000000.0 12000000 2.400000
1009 Centre-Back 25 Rangers Premiership Newcastle Premier League 2004-2005 8500000.0 11300000 1.329412
1011 Centre-Forward 25 Man City Premier League Fenerbahce Süper Lig 2004-2005 15000000.0 10700000 0.713333


4   Data Analysis / Visualization

Visualize attribute distributions. Choose and visualize distributions for a subset of single attributes. Choose any appropriate visualization such as histograms, kernel density estimation, box plots, etc. Describe anything meaningful or potentially interesting you discover from these visualizations. Note: You can also use data from other sources to bolster visualizations. Visualize at least 5 attributes, at least one categorical and at least one numeric.

Visualize relationships between a subset of attributes. Use whichever visualization method is appropriate for your data. Explain any interesting relationships. Important: Interpret the implications for each visualization. Visualize at least three subsets of the attributes.

4.1   Attribute Distribution Visualization

4.1.1 &nbsp Transfer ferquency VS Transfer fee & Transfer frequency VS Age

In [10]:
plt.style.use('ggplot')

plt.figure(figsize=(20, 10))

plt.subplot(1,2,1)
plt.xlabel('Transfer_fee')
plt.ylabel('Number')
plt.title('Transfer frequency vs Transfer fee', fontsize=18)
df.Transfer_fee.plot.hist(bins=100,  color='#607c8e',logx=True)

plt.subplot(1,2,2)
plt.xlabel('Age')
plt.ylabel('Number')
plt.xlim([15,36])
plt.title('Transfer frequency vs Age', fontsize=18)
df.Age.plot.hist(bins=20, color='#607c8e')
df.Age.plot.kde(bw_method=0.2, secondary_y=True)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2760d6d8>

The two plots are just as one would expect. Usually a player's transfer fee is around 10 million to 30 million euros because that is what a middle class team can afford to pay a single player. Notice the one dot around 200 million, that's Neymar's transfer, the most expensive transfer in the soccer history. However, that does not happen very often. On the right side we have a plot in terms of transfer frequency versus age. 25-year-old players tend to have the most transfer frequency because they are more mature than younger players and have less injury than older players. Meanwhile, they are experienced and know how to play in a system.

4.1.2   League that has more than 20 top 250 transfer

In [12]:
# df_new=df.groupby('League_from').size()
df_League=df[df['League_from'].groupby(df['League_from']).transform('size')>20].groupby('League_from').size()

df_League.plot(kind='barh', fontsize=14, figsize=(10,6))
plt.xlabel('Number of top 250 transfer', fontsize=14)
plt.title('League that has more than 20 top 250 transfer', fontsize=18)
Out[12]:
Text(0.5, 1.0, 'League that has more than 20 top 250 transfer')

Not surprisingly, Serie A from Italy, Premier League from England, Ligue 1 from France, LaLiga from Spain, 1.Bundesliga from Germany frequently participate in trades involved with high transfer fees. These five leagues have the highest audience ratings and the most complete soccer systems with cantera(a term used in Spain to refer to youth academies and farm teams) and policies. In the past decade we have seen Serie A and Premier League compete for the top rank in most transfers. The picture below is provided by https://www.sbnation.com/soccer/2014/7/28/5923187/transfer-window-soccer-europe-explained image.png

4.1.3   Positions that have top 250 transfer

In [13]:
df_Position_freq=df.groupby('Position').size()

df_Position_freq.plot(kind='barh', fontsize=14, figsize=(10,6))
plt.xlabel('Number of top250 transfer', fontsize=14)
plt.title('Positions that have top 250 transfer', fontsize=18)
Out[13]:
Text(0.5, 1.0, 'Positions that have top 250 transfer')

With no doubt, Centre-Forward, the main attacker, is the most important position in the field. Sometimes a good striker can help the team win the match singlehandedly. On the other side of the field, Centre-Back, directly playing against opposing forwarders, are also considered valueable. That's why centre-back is the second most frequent position in top 250 transfers.

In [14]:
# df_Season_fee=df[['Season', 'Transfer_fee']]
# df_Season_fee=df_Season_fee.groupby('Season')['Transfer_fee'].mean()

df_Season_fee=df[df.Transfer_fee >30000000]
df_Season_fee=df_Season_fee.groupby('Season').size()

ax = df_Season_fee.plot(kind='bar',fontsize=14, figsize=(10,8),color='y')
plt.title('Transfee fee over 30 million')
plt.ylabel('Million', fontsize=14)

plt.show()

The soccer player market price is increasing over time. Note that 2018-2019 is way lower than 2017-2018. That's because Neymar's transfer significantly affected the market.

4.2   Attribute Relationship Visualization

4.2.1 Average transfer fee and market value based on position

In [15]:
df['fee_range'] = pd.cut(df['Transfer_fee'],[0,2e6,2e7,5e7,5e8],
                                 labels=['2millions','20millions','50millions','500millions'])
df['Age_range'] = pd.cut(df['Age'],[16,20,23,28,35],
                                 labels=['16-20','20-23','23-28','28-35'])
df
def conditions(s):
    if (s['Position'] == 'Second Striker') or (s['Position'] == 'Centre-Forward') or (s['Position'] == 'Right Winger') or (s['Position'] == 'Left Winger'):
        return 'Forward'
    elif (s['Position'] == 'Attacking Midfield' or s['Position'] == 'Left Midfield' or s['Position'] == 'Defensive Midfield' or s['Position'] == 'Right Midfield'):
        return 'Midfield'
    else:
        return 'Back'
df['General_Position']=df.apply(conditions, axis=1)
df_new1=df.groupby(by=['Age_range','General_Position'])['Transfer_fee','Market_value'].mean()
ax = df_new1.plot(kind='barh',fontsize=14, figsize=(10,8))
plt.title('Average transfer fee and market value based on position')
plt.show()

From the above graph we can see that transfer fees are mostly higher than market value regardless of the player's position. This is usually caused by the liquidated damages in player contracts. Players playing forwards have the largest market value and transfer fee.

4.2.2   Average Premium Rate based on Age

In [16]:
df_prem_age=df[['Age','premium_rate']].groupby('Age')['premium_rate'].mean()

fig = plt.figure(figsize=(15,5))
ax = df_prem_age.plot(kind='bar',fontsize=14, figsize=(15,8),color='k')
ax.axhline(y=1)
plt.title('Average Premium Rate based on Age')
plt.ylabel('Premium_rate')
plt.show()

If a bar fails to pass the red line, then the transfer fee is lower than the market value. It has become standard for soccer clubs to buy very young player with huge potential. In fact, it is not just the potential that causes a huge premium rate. A young player usually has less salary, and can recover fast if injured. When players reach a certain age, such as 34, their strength, as well as stamina, decreases. Clubs are less likely to buy older players with a price over market value.

4.2.3   Average Premium Rate based on League

In [17]:
df_league_mean = df[['League_from','premium_rate']].groupby('League_from').mean()
df_league = pd.DataFrame(df_league_mean)
rate = df_league['premium_rate'][['Serie A', 'Premier League', 'LaLiga', 'Ligue 1', '1.Bundesliga']]
new_rate = np.array(rate)
name = ['Serie A', 'Premier League', 'LaLiga', 'Ligue 1', '1.Bundesliga']
new_table = pd.DataFrame(new_rate,name)

ax = new_table.plot(kind='bar',fontsize=14, figsize=(10,8),color='y')
ax.axhline(y=1)
plt.title('Average Premium Rate based on League')
plt.xlabel('League name', fontsize=14)
plt.ylabel('average premium rate', fontsize=14)
plt.show()

This picture above is about the average premium rate of the top 5 league in this rank list. The league Ligue 1 have the highest premium rate because teams mainly purchase young players with a perfect pontential, not worth the price they pay. Bundesliga have the lowest rate because Bundeslige has a tradition of controlling of budget, so they always purchase some players who have a stable performance, not those who may have high potential.

4.2.4   is Premium or not

In [18]:
import seaborn as sns
cmap = sns.diverging_palette(220, 10, as_cmap=True)

fig = plt.figure(figsize=(12,5))
df['isPremium'] = np.where(df['premium_rate']>1, 1, 0)
sns.violinplot(x="fee_range", y="Age", hue="isPremium", data=df, 
               split=True, inner="quart")
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27c3f0f0>

The figure above has two features and depicts the x-axis as the fee range and the y-axis as the age range.

  • Players from four different transfer fee ranges all share the same trait: The older a player, the larger his chance of having a premium rate less than 1. It is easy to understand that as a player gets older, his potential and health deteriorate. Therefore, not many clubs will buy such player for a price larger than market value.
  • From these four ranges, we see that no matter what transfer fee range a player is in, the possibility of premium is fairly equal.

4.2.5   Information about Two Different Age Range

In [19]:
# plot the correlation matrix 
vars_to_use = ['Age','Market_value', 'Transfer_fee','premium_rate'] # pick vars
df_young_age=df[df.Age<25]
df_old_age=df[df.Age>31]

plt.figure(figsize=(30, 10))
plt.subplot(1,2,1)
plt.pcolor(df_young_age[vars_to_use].corr()) # do the feature correlation plot
# fill in the indices
plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.colorbar()
plt.subplot(1,2,2)
plt.pcolor(df_old_age[vars_to_use].corr()) # do the feature correlation plot
# fill in the indices
plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.colorbar()
plt.show()

Soccer transfer has a feature: If a player is under 20, his market value tends to increase; if a player is above 30, his market value tends to decrease. Therefore, we made two plots. On the left is the correlation matrix for young players; On the right is the correlation matrix for players older than thirty. Since 26-30 years old are the golden ages of a player, so they in that range, age has nothing to do with the market value

From the plots we see that market value and transfer fee are definitely positive correlated. Like I have stated above, the older the player gets, the less market value a player will have.

5   EXCEPTIONAL WORK

You have free reign to provide any additional analyses. One idea (required for 7000 level students): implement dimensionality reduction using t-SNE, then visualize and interpret the results. Give an explanation of t-SNE dimensionality reduction methods.

In [20]:
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.offline as pyoff


print('plotly', plotly.__version__)

plotly.offline.init_notebook_mode() # run at the start of every notebook
plotly 3.5.0
In [21]:
%%time

from sklearn.manifold import TSNE
from sklearn.datasets import load_iris,load_digits
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X = np.array(df[['Age','Market_value','Transfer_fee']])

tsne = TSNE(n_components=2)
tsne.fit_transform(X)
print(tsne.embedding_)
[[-40.211243  -22.299408 ]
 [-42.73151   -11.380989 ]
 [-12.888324    2.405125 ]
 ...
 [ 17.219772   66.07616  ]
 [ 45.020473   14.418088 ]
 [ 46.215538   -4.9652066]]
CPU times: user 20.1 s, sys: 1.37 s, total: 21.4 s
Wall time: 21.4 s
In [22]:
%%time

tsne = TSNE(n_components=2, init='pca', random_state=0)
Y = tsne.fit_transform(X)
CPU times: user 36.3 s, sys: 341 ms, total: 36.6 s
Wall time: 35.7 s
In [25]:
colors1 = X[:, 0]

fig = go.Figure()
fig.add_scatter(x=Y[:, 0],
                y=Y[:, 1],
                mode='markers',
                marker={'color': colors1,
                        'opacity': 1.0,
                        'colorscale': 'Viridis'
                       });
# pyoff.iplot(fig)
plotly.offline.iplot(fig)

t-sne is a machine learning algorithm for visualization used for high-level representations. Thus, it's a good way to decrease the high dimensions to two or three dimensions. In our datasets, we selected attributes "Age", "Market_value" and "Transfer_fee". After t-sne we used plotly to visualize the result and colored the graph according to the instance's "Age" attribute. For the x axis and y axis of the graph are more like to mean nothing and t-sne is just a visualization technique, it will be important to observe the distribution.(quote: https://stats.stackexchange.com/questions/254090/what-are-the-axes-of-a-t-sne-scatterplot) Our resulting graph shows the upper bound as purple, which corresponds to younger athletes. Green correlates with older athletes. This solidfies our belief that athletes' ages truly affect their trading value.

In [26]:
colors2 = X[:, 2]

fig2 = go.Figure()
fig2.add_scatter(x=Y[:, 0],
                y=Y[:, 1],
                mode='markers',
                marker={'color': colors2,
                        'opacity': 1.0,
                        'colorscale': [[0,'green'],[0.2,'red'],[0.4,'yellow'],[0.6,'blue'],[1.0,'pink']]
                       });
# pyoff.iplot(fig2)
plotly.offline.iplot(fig2)

This graph is colored by "Transfer_fee" attribute. Most parts of the graph is green and red, which are considered lower. Only a few points are colored as pink and blue. Thus, only a few athletes can get high transfer fee.

In [27]:
colors3 = np.array(df['isPremium'])

fig3 = go.Figure()
fig3.add_scatter(x=Y[:, 0],
                y=Y[:, 1],
                mode='markers',
                marker={'color': colors3,
                        'opacity': 1.0,
                        'colorscale': 'Viridis'
                       });

# pyoff.iplot(fig3)
plotly.offline.iplot(fig3)

This graph is colored by "isPremium" attribute. "isPremium" is a boolean value, so the graph seperated into two parts.

When we see these three graphs together, we can find that most young athletes may not be qualified for high transfer fees, but their transfer fees are always higher than their market values. The older an athlete is, the less likely his transfer fee is higher than his market value. What's more, the highest transfer fee an athlete can get always comes between 25-30 years old.